White wine data set analysis by Erni Durdevic

In this project I wanted to explore how wine quality of a white wines dataset is influenced by chemical measurements of the wine. I’ll explore the dataset looking for the features that have the highest impact on wine quality and I’ll try to find a linear model that given a set of wine features can predict the wine quality.

Univariate Plots Section

To begin I wanted to explore the dataset features summary

## 'data.frame':    4898 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ score               : Ord.factor w/ 9 levels "1"<"2"<"3"<"4"<..: 6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##                                                                   
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##                                                        
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##                                                                        
##     alcohol         quality          score     
##  Min.   : 8.00   Min.   :3.000   6      :2198  
##  1st Qu.: 9.50   1st Qu.:5.000   5      :1457  
##  Median :10.40   Median :6.000   7      : 880  
##  Mean   :10.51   Mean   :5.878   8      : 175  
##  3rd Qu.:11.40   3rd Qu.:6.000   4      : 163  
##  Max.   :14.20   Max.   :9.000   3      :  20  
##                                  (Other):   5

I noticed that all variables are numeric and that quality variable could be transformed to an ordered factor, so I added a new variable called score (this code has been moved at the biginning of the file) in order to have both quality and score variable for each observation. The summary function gave me a sense of the variables distribution, but I’m going to explore all the variables by plotting their distributions:

## Warning: position_stack requires constant width: output may be incorrect

Univariate Analysis

What is the structure of your dataset?

The data set has 4898 observations of 13 variables:

$ X : int -> Progressive number

$ fixed.acidity : num 3.8 - 14.2

$ volatile.acidity : num 0.08 - 1.1

$ citric.acid : num 0.00 - 1.17

$ residual.sugar : num 0.6 - 65.8

$ chlorides : num 0.009 - 0.34

$ free.sulfur.dioxide : num 2.0 - 289.0

$ total.sulfur.dioxide: num 9.0 - 440.0

$ density : num 0.987 - 1.039

$ pH : num 2.72 - 3.82

$ sulphates : num 0.22 - 1.08

$ alcohol : num 8.0 - 14.2

$ quality : int 3 - 9

Alcohol variable is more widely distributed, almost linearly between 9.9 and 12.
Quality is an integer type, but can be considered as an ordinated factor, so I created the “score” ordered factor with quality value.

What is/are the main feature(s) of interest in your dataset?

The main features I was interested in were quality and alcohol.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

All the features are interesting. I suppose that wines with low acidity, chlorides and sulphates will score better than other wines. The distributions above does

Did you create any new variables from existing variables in the dataset?

Yes, I created a “score” variable, which is an ordinated factor of the “quality” variable. Each one of the possible quality values (int numbers from 3 to 9) was transformed into a factor. I wanted to keep both quality and score as “int” and “ordered factor” variables because int can be used to calculate correlation while ordered factor can be used to separate observations into groups.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The alcohol distribution was unusual, it was not gaussian. The dataset was already in tity format and I did not have to make adjustments. As described above, I transformed the “quality” integer variable into an ordinated factor called “score”.

Bivariate Plots Section

I’ll start by exploring the ggpairs matrix on a sample of 1000 observations on the dataset. I renamed the dataset features in order to make the plot more readable, but I was unable to suppress warnings and resize the correlation font size. I printed the correlation summary after the plot.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Since the correlation is unreadable on the ggpairs plot, here there is a more readable version

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.25581431      0.002857966
## fixed.acidity        -0.255814305    1.00000000     -0.022697290
## volatile.acidity      0.002857966   -0.02269729      1.000000000
## citric.acid          -0.149899918    0.28918070     -0.149471811
## residual.sugar        0.006623775    0.08902070      0.064286060
## chlorides            -0.045645192    0.02308564      0.070511571
## free.sulfur.dioxide  -0.011928911   -0.04939586     -0.097011939
## total.sulfur.dioxide -0.161979037    0.09106976      0.089260504
## density              -0.185976097    0.26533101      0.027113845
## pH                   -0.115774132   -0.42585829     -0.031915368
## sulphates             0.009807759   -0.01714299     -0.035728147
## alcohol               0.213656245   -0.12088112      0.067717943
## quality               0.035763247   -0.11366283     -0.194722969
##                       citric.acid residual.sugar   chlorides
## X                    -0.149899918    0.006623775 -0.04564519
## fixed.acidity         0.289180698    0.089020701  0.02308564
## volatile.acidity     -0.149471811    0.064286060  0.07051157
## citric.acid           1.000000000    0.094211624  0.11436445
## residual.sugar        0.094211624    1.000000000  0.08868454
## chlorides             0.114364448    0.088684536  1.00000000
## free.sulfur.dioxide   0.094077221    0.299098354  0.10139235
## total.sulfur.dioxide  0.121130798    0.401439311  0.19891030
## density               0.149502571    0.838966455  0.25721132
## pH                   -0.163748211   -0.194133454 -0.09043946
## sulphates             0.062330940   -0.026664366  0.01676288
## alcohol              -0.075728730   -0.450631222 -0.36018871
## quality              -0.009209091   -0.097576829 -0.20993441
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                          -0.0119289106         -0.161979037 -0.18597610
## fixed.acidity              -0.0493958591          0.091069756  0.26533101
## volatile.acidity           -0.0970119393          0.089260504  0.02711385
## citric.acid                 0.0940772210          0.121130798  0.14950257
## residual.sugar              0.2990983537          0.401439311  0.83896645
## chlorides                   0.1013923521          0.198910300  0.25721132
## free.sulfur.dioxide         1.0000000000          0.615500965  0.29421041
## total.sulfur.dioxide        0.6155009650          1.000000000  0.52988132
## density                     0.2942104109          0.529881324  1.00000000
## pH                         -0.0006177961          0.002320972 -0.09359149
## sulphates                   0.0592172458          0.134562367  0.07449315
## alcohol                    -0.2501039415         -0.448892102 -0.78013762
## quality                     0.0081580671         -0.174737218 -0.30712331
##                                 pH    sulphates     alcohol      quality
## X                    -0.1157741316  0.009807759  0.21365624  0.035763247
## fixed.acidity        -0.4258582910 -0.017142985 -0.12088112 -0.113662831
## volatile.acidity     -0.0319153683 -0.035728147  0.06771794 -0.194722969
## citric.acid          -0.1637482114  0.062330940 -0.07572873 -0.009209091
## residual.sugar       -0.1941334540 -0.026664366 -0.45063122 -0.097576829
## chlorides            -0.0904394560  0.016762884 -0.36018871 -0.209934411
## free.sulfur.dioxide  -0.0006177961  0.059217246 -0.25010394  0.008158067
## total.sulfur.dioxide  0.0023209718  0.134562367 -0.44889210 -0.174737218
## density              -0.0935914935  0.074493149 -0.78013762 -0.307123313
## pH                    1.0000000000  0.155951497  0.12143210  0.099427246
## sulphates             0.1559514973  1.000000000 -0.01743277  0.053677877
## alcohol               0.1214320987 -0.017432772  1.00000000  0.435574715
## quality               0.0994272457  0.053677877  0.43557472  1.000000000

===============================================================

There are too many variables in the above ggpairs plot. In the following I selected the most interesting from the previews ggpairs plot.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

=======================================================================

There is a high correlation between density and residual sugar (0.839), I’m courious to see in detail this scatter plot. I filtered the results to exclude outliers and added an alpha to avoid overlplotting.

==================================================

There is also an interesting correlation (-0.78) between density and alcohol.

===================================================

There is a smaller but significative correlation between alcohol and residual sugar (-0.451)

In this plot the majority of the points are lower-left corner, while there are very few dots in the upper right corner. There is also a high concentration of dots on very low residual sugar values, this can be because winemakers tend to last the fermentation as long as possible transforming all the sugar into alcohol.

====================================================

Alcohol has the strongest correlation with quality (0.436), followed by density (-0.307) and chlorides (-0.210) while other variables have a lower impact on the quality.

Alcohol by Quality

Density by Quality

Chlorides by Quality

Other less correlated variables

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality seems to be strongly correlated with alcohol (0.436), density (-0.307), chlorides (-0.21), volatile acidity (-0.195) and total sulfur dioxide (-0.175).

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There is an evident correlation between density and residual sugar (0.839). This is due to the process of fermentation that transforms sugar (dense) to alcohol (less dense). This is confirmed by the negative correlation between residual sugar and alcohol (-0.451).

What was the strongest relationship you found?

The strongest relationship I found is between density and residual sugar (0.839). This relationship can be explained by the natural wining process of sugar conversion into alcohol.

I also found another strong relationship between alcohol and wine quality (0.436).

Multivariate Plots Section

We are most interested in wine quality, I’ll try to use this parameter to color the output and search for patterns.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

======================================================== Let’s have a closer look to alcohol by density and alcohol by residual sugar, colored by quality score

In the following plots I’ll not plot the median quality value (6) to avoid overplotting and have a better distinction between the good and bad wines.

As alcohol increases, we get more quality wines in both plots. In the first one, we can also see that, as the alcohol concentration increases, the density decreases. In the second one we can see that for highly alcoholic wines there is less residual sugar.

=============================================================

Residual sugar by Density

In the following plot I’ll avoid plotting the median score value (6).

By coloring the scatter plot by quality, we can notice that better wines have higher residual sugar for the same density values.

===================================================================

Volatile acidity by density

Low density and low volatile acidity have both an impact on the wine quality, but there is no particular pattern correlating the two factors.

==========================================================

Alcohol volumne wines count by quality

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Wines with score 5 or lower are more concentrated on lower alcohol percentage.

===================================================

Linear models

Let’s create a linear model to see if we can predict quality based on the main correlated features.

## 
## Calls:
## m1: lm(formula = (quality ~ alcohol), data = wines)
## m2: lm(formula = quality ~ alcohol + density, data = wines)
## m3: lm(formula = quality ~ alcohol + density + residual.sugar, data = wines)
## m4: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity, 
##     data = wines)
## m5: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity + 
##     chlorides, data = wines)
## m6: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity + 
##     chlorides + total.sulfur.dioxide, data = wines)
## 
## =======================================================================================
##                           m1         m2         m3         m4         m5         m6    
## ---------------------------------------------------------------------------------------
## (Intercept)            2.582***  -22.492***  90.313***  74.225***  73.271***  81.344***
##                       (0.098)     (6.165)   (12.374)   (11.977)   (11.999)   (12.246)  
## alcohol                0.313***    0.360***   0.246***   0.286***   0.283***   0.284***
##                       (0.009)     (0.015)    (0.018)    (0.018)    (0.018)    (0.018)  
## density                           24.728*** -87.886*** -71.546*** -70.514*** -78.777***
##                                   (6.079)   (12.317)   (11.923)   (11.949)   (12.209)  
## residual.sugar                                0.053***   0.052***   0.052***   0.053***
##                                              (0.005)    (0.005)    (0.005)    (0.005)  
## volatile.acidity                                        -2.059***  -2.044***  -2.077***
##                                                         (0.109)    (0.110)    (0.110)  
## chlorides                                                          -0.692     -0.769   
##                                                                    (0.540)    (0.540)  
## total.sulfur.dioxide                                                           0.001** 
##                                                                               (0.000)  
## ---------------------------------------------------------------------------------------
## R-squared                 0.190      0.192      0.210      0.264      0.264      0.266 
## adj. R-squared            0.190      0.192      0.210      0.263      0.263      0.265 
## sigma                     0.797      0.796      0.787      0.760      0.760      0.759 
## F                      1146.395    583.290    434.085    438.646    351.293    295.042 
## p                         0.000      0.000      0.000      0.000      0.000      0.000 
## Log-likelihood        -5839.391  -5831.127  -5776.812  -5604.126  -5603.301  -5598.094 
## Deviance               3112.257   3101.773   3033.737   2827.187   2826.235   2820.233 
## AIC                   11684.782  11670.255  11563.624  11220.251  11220.603  11212.189 
## BIC                   11704.272  11696.241  11596.107  11259.231  11266.079  11264.161 
## N                      4898       4898       4898       4898       4898       4898     
## =======================================================================================

Every feature is contributing in slightly increasing the accuracy of the model, but the overall result is not satisfactory. An r squared of 0.266 is very low.

There is a good correlation between density, residual sugar and alcohol.

## 
## Calls:
## m10: lm(formula = (density ~ residual.sugar), data = wines)
## m11: lm(formula = density ~ residual.sugar + alcohol, data = wines)
## 
## =====================================
##                    m10        m11    
## -------------------------------------
## (Intercept)      0.991***   1.005*** 
##                 (0.000)    (0.000)   
## residual.sugar   0.000***   0.000*** 
##                 (0.000)    (0.000)   
## alcohol                    -0.001*** 
##                            (0.000)   
## -------------------------------------
## R-squared            0.704      0.907
## adj. R-squared       0.704      0.907
## sigma                0.002      0.001
## F                11636.984  23791.076
## p                    0.000      0.000
## Log-likelihood   24498.873  27328.019
## Deviance             0.013      0.004
## AIC             -48991.747 -54648.037
## BIC             -48972.257 -54622.051
## N                 4898       4898    
## =====================================

Infact this model is much better. Alcohol concentration and residual sugar are the main factors in determinating the density.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Yes, in general wines with lower density tend to have higher quality, while residual sugar does not seem to have a clear impact on the quality. Combining residual sugar and density, we can see that for a given density, wines with higher residual sugar have higher quality.

Were there any interesting or surprising interactions between features?

It was interesting how density is correlated with sugar and alcohol content. The longer the wine fermentation lasts, the lower is the residual sugar and the higher is the alcohol percentage. The final residual sugar and alcohol percentage are the main factors in density measure.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created two models for the sample.

The first one to predict the quality of the wine based on the dataset features. This model was very weak, it had an R squared value of 0.266. It suggests that it is really hard to predict the quality of the wine based on the objective measurments of the wine chemical components.

The second model to predict the wine density based on residual sugar and alcohol. This model was quite accurate, with an R squared value of 0.9.


Final Plots and Summary

Plot One

## [1] 0.9258881

Description One

The first plot shows the quality distribution of the wines in the dataset. The dataset contains wines which scored from 3 to 9 in a distribution close to binobial. Tehre are very few wines scoring 9 and 3 quality points, while the vast majority of the wines (92.5 %) are scoring 5, 6 and 7 points.

Plot Two

## [1] 0.4355747
## [1] 0.4675664
## [1] -0.1321443

Description Two

The exploratory analysis showed that alcohol percentage has an influence on wine quality (the correlation between alcohol and quality is 0.436), to explain this relation I created this box plot with the concentration of alcohol in wines for the different quality scores. There is a tendency for better wines (scoring 7 or above) to have a higher alcohol concentration. This almost linear correlation between score and alcohol concentration is only valid between the scores of 5 and 9 (the correlation between 5 and 9 between alcohol and score is 0.468), but there is a countertendency for scores lower than 5 (the correlation between 3 and 5 is -0.132). This countertendency makes the model function not reversible, therefore difficult to predict the score based on the alcohol percentage with a model.

Plot Three

## Warning: Removed 5 rows containing missing values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).
## Warning: Removed 13 rows containing missing values (geom_path).

Description Three

This scatter plot represents the relation between wine density and residual sugar, colored by wine quality. The regression line represent the linear correlation between density and residual sugar. The plot shows how very good wines are more concetrated over the regression line, they tend to have lower density and higher residual sugar. This confirms the precedent plot, because the wine should have a high percentage of alcohol to have high residual sugar and low density.


Reflection

The dataset was tidy and clean, so I had the chance to dig directly into the analysis. The ggpairs plot was very useful in spotting the possible variable correlation and gave me several insights. I had some struggles in finding the ggpairs documentation and in formatting it for the kint file.

Some data that would be interesting to analyse would be for sure the geographical position (and height above the sea) and production year. I think that this features can have a significant factor in determinating the wine quality because altitude and weather can have an impact on the sugar quantity before fermentation, so would lead to a higher final alcohol volume and residual sugar.

The wines dataset shows that the wine quality appreciated by the humans is far more complex than the objective parameters of the wine chemical composition observed in the data set. It is not possible to judge the wine quality on these parameters alone, but there are some features that do have an impact on the perceived quality of the wine. In general we tend to prefer wines with high alcohol concentration percentage, while factors like chlorides, volatile acidity and total sulfur dioxide have a bad impact on wine taste.